Significantly Lower Entropy Estimates for Natural DNA Sequences
نویسندگان
چکیده
If DNA were a random string over its alphabet {A, C, G, T}, an optimal code would assign two bits to each nucleotide. DNA may be imagined to be a highly ordered, purposeful molecule, and one might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly, this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than fivefold (1.66) and may lead to better performance in microbiological pattern recognition applications. One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using Expectation Maximization (EM). Experiments are reported using a wide variety of DNA sequences and compared whenever possible with earlier work. Four reasonable notions for the string distance function used to identify near matches, are implemented and experimentally compared. We also report lower entropy estimates for coding regions extracted from a large collection of nonredundant human genes. The conventional estimate is 1.92 bits. Our model produces only slightly better results (1.91 bits) when considering nucleotides, but achieves 1.84-1.87 bits when the prediction problem is divided into two stages: (i) predict the next amino acid-based on inexact polypeptide matches, and (ii) predict the particular codon. Our results suggest that matches at the amino acid level play some role, but a small one, in determining the statistical structure of nonredundant coding sequences.
منابع مشابه
Signi cantly Lower Entropy Estimates for Natural DNA Sequences
If DNA were a random string over its alphabet fA;C;G; Tg, an optimal code would assign 2 bits to each nucleotide. DNA may be imagined to be a highly ordered, purposeful molecule, and one might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly this has not been the case for many natural DNA sequences, including porti...
متن کاملEntropy concepts and DNA investigations
Topological and metric entropies of the DNA sequences from different organisms were calculated. Obtained results were compared each other and with ones of corresponding artificial sequences. For all envisaged DNA sequences there is a maximum of heterogeneity. It falls in the block length interval [5,7]. Maximum distinction between natural and artificial sequences is shifted on 1-3 position from...
متن کاملTHE ENTROPIES OF THE SEQUENCES OF FUZZY SETS AND THE APPLICATIONS OF ENTROPY TO CARDIOGRAPHY
In this paper, rstly we have introduced to entropy of sequences of fuzzy sets and given sometheorems about it. Secondly, the waves P and T which appears in electrocardiograms weretransferred to fuzzy sets, by using denition of entropy for sequences of fuzzy sets, and somenumerical values were obtained for sequences of waves P and T. Thus any person can makea medical predictions for some cardiac...
متن کاملClustering of a Number of Genes Affecting in Milk Production using Information Theory and Mutual Information
Information theory is a branch of mathematics. Information theory is used in genetic and bioinformatics analyses and can be used for many analyses related to the biological structures and sequences. Bio-computational grouping of genes facilitates genetic analysis, sequencing and structural-based analyses. In this study, after retrieving gene and exon DNA sequences affecting milk yield in dairy ...
متن کاملEstimates of the information content and dimensionality of natural scenes from proximity distributions.
Natural scenes, like most all natural data sets, show considerable redundancy. Although many forms of redundancy have been investigated (e.g., pixel distributions, power spectra, contour relationships, etc.), estimates of the true entropy of natural scenes have been largely considered intractable. We describe a technique for estimating the entropy and relative dimensionality of image patches ba...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Journal of computational biology : a journal of computational molecular cell biology
دوره 6 1 شماره
صفحات -
تاریخ انتشار 1997